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Key words Abstract The aim of this study was to assess potential candidate gene regions and corresponding universal 
DNA barcoding primer pairs as secondary DNA barcodes for the fungal kingdom, additional to ITS rDNA as primary barcode. 
ITS supplement Amplification efficiencies of 14 (partially) universal primer pairs targeting eight genetic markers were tested across 
molecular taxonomy > 1 500 species (1 931 strains or specimens) and the outcomes of almost twenty thousand (19 577) polymerase 
phylogeny chain reactions were evaluated. We tested several well-known primer pairs that amplify: i) sections of the nuclear 
species identification ribosomal RNA gene large subunit (D1—D2 domains of 26/288); ii) the complete internal transcribed spacer region 
universal primers (ITS1/2); iii) partial B-tubulin II (TUB2); iv) y-actin (ACT); v) translation elongation factor 1-a (TEF 1a); and vi) the 
second largest subunit of RNA-polymerase II (partial RPB2, section 5—6). Their PCR efficiencies were compared 
with novel candidate primers corresponding to: i) the fungal-specific translation elongation factor 3 (TEF3); ii) a 
small ribosomal protein necessary for t-RNA docking; iii) the 60S L10 (L1) RP; iv) DNA topoisomerase | (TOPI); 
v) phosphoglycerate kinase (PGK); vi) hypothetical protein LNS2; and vii) alternative sections of TEF1a. Results 
showed that several gene sections are accessible to universal primers (or primers universal for phyla) yielding a 
single PCR-product. Barcode gap and multi-dimensional scaling analyses revealed that some of the tested candidate 
markers have universal properties providing adequate infra- and inter-specific variation that make them attractive 
barcodes for species identification. Among these gene sections, a novel high fidelity primer pair for TEF 1a, already 
widely used as a phylogenetic marker in mycology, has potential as a supplementary DNA barcode with superior 
resolution to ITS. Both TOPI and PGK show promise for the Ascomycota, while TOPI and LNS2 are attractive for 
the Pucciniomycotina, for which universal primers for ribosomal subunits often fail. 
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INTRODUCTION 


Identification and classification of eukaryotes increasingly de- 
pends on DNA sequences of standardised genetic markers, 
a concept known as DNA barcoding (Hebert et al. 2003a, b, 
Hebert & Gregory 2005, Meyer & Paulay 2005, Schindel & Miller 
2005, Schoch et al. 2012). An intense debate is ongoing con- 
cerning whether the identification of organisms of unresolved 
alpha taxonomy is amenable to DNA barcoding, because in sili- 
co-based identification requires gene sequences that accurately 
reflect natural classifications (Eberhardt 2010, Schlick-Steiner 
et al. 2010). If this prior condition is not adequately fulfilled a 
posteriori nesting of ‘unknowns’ among known fungal taxa is 
impossible. DNA barcoding has evolved, despite its unresolved 
theoretical and taxonomic issues, as a standard procedure in 
organism identification among various disciplines of modern 
biology (Tautz et al. 2003, Shokralla et al. 2014, Stockinger et 
al. 2014, Stoeckle & Thaler 2014). 


The milestone paper on amplification and direct sequencing of 
fungal ribosomal RNA genes for phylogenetics by White et al. 
(1990, now > 15 000 citations), described primers that allowed 
simple, rapid PCR-based amplification and Sanger sequencing 
of sections of the fungal rDNA operon. Inspired by the efforts 
to resolve bacterial taxonomy using 16S sequences summa- 
rised by Woese (1987) and further elaborated by Weisburg et 
al. (1991) and Stackebrandt & Goebel (1994), the White et al. 
paper generated an explosion of phylogenetic research that 
now dominates fungal taxonomy. The modern concept of fungal 
DNA barcoding does not differ substantially from approaches 
proposed more than two decades ago. Sanger DNA sequencing 
of nuclear rDNA domains remains the most widely accepted 
approach in molecular mycology to classify and to identify un- 
known fungal specimens or cultures. The first comprehensive 
database of fungal DNA barcodes was established for yeasts 
(Kurtzman & Robnett 1998, Fell et al. 2000, Scorzetti et al. 2002) 
and revolutionised the approach to species recognition in that 
group. Where once a single taxonomist could identify only a 
handful of yeast strains in a week because of the plethora 
of physiological tests required, now it is possible to process 
hundreds. The result has been a rapid increase in the number 
of recognised yeast species, and a wealth of ecological data 
(Kurtzman et al. 2015). 


The broad acceptance of DNA barcoding today relies on public 
repositories such as the INSDC (http://www.insdc.org/), which 
accessions hundreds of thousands of sequence entries (Schoch 
et al. 2012, 2014). In a concerted action with the Consortium 
for the Barcode Of Life (CBOL), the Fungal Barcoding Consor- 
tium (Schoch et al. 2012) ratified the ITS as the universal DNA 
barcode for the fungal kingdom using the same gene section 
proposed by White et al. (1990) more than 20 years earlier. 
This study (Schoch et al. 2012) was widely accepted by the 
community and already has been highly cited > 850 times. 


Abandoning dual nomenclature of pleomorphic fungi in favour of 
a single name system in the nomenclatural code was a radical 
change (Gams & Jaklitsch 2011, Taylor 2011), partly enabled by 
the possibility to unequivocally demonstrate phylogenetic rela- 
tionships between asexual and sexual states that were formerly 
classified and named separately in parallel systems under the 


provisions of the former Article 59 (Guadet et al. 1989, Bruns 
et al. 1991, Berbee & Taylor 1992, Reynolds & Taylor 1993). 
This change to the botanical code was driven by DNA data, 
ITS and other rDNA sequences in particular, and demonstrated 
the impact of DNAsequencing on our understanding of genetic 
diversity, species identification, delimitation and nomenclatural 
changes in mycology. 


Despite the strength and impact of rDNA ITS as the sanctioned 
universal fungal DNA barcode, its resolution of higher taxonomic 
level relationships is inferior to many protein-coding genes such 
as RPB1, RPB2 or TUB2 (Nilsson et al. 2006, Seifert 2009, Be- 
gerow et al. 2010, Schoch et al. 2014). However, as a barcode 
ITS outperforms alternative loci because of its highly robust 
PCR amplification fidelity (> 90 % success rates), a Probability 
of Correct Identification (PCI) of about 70 %, and applicability 
to a wide range of sample conditions. Although many alterna- 
tives to ITS were considered by the mycological community, its 
sanctioning as the primary barcode marker rested on its practi- 
cality and reliability, not on the highly desired ‘resolution power’ 
(Schoch et al. 2012, 2014). Many mycologists would prefer one 
or several universal, but phylogenetically informative loci as 
barcodes, with higher species resolution power than is feasible 
with ITS. The ideal genetic marker would have high inter- and 
low intra-species sequence divergence, i.e. a discrete barcode 
gap (Schoch et al. 2012, Samerpitak et al. 2015), and accurately 
reflect higher-level taxonomic affiliations. It could then serve as 
a substitute for ITS, or as a supplementary, secondary or tertiary 
marker in concert with ITS as the primary barcode. 


Molecular taxonomists thus continue to search for genes 
conserved enough to allow reliable priming but sufficiently 
variable to yield highly resolved and well-supported phylo- 
grams and useful barcode gaps (Schmitt et al. 2009, Feau et 
al. 2011, Lewis et al. 2011, Robert et al. 2011, Walker et al. 
2012, Capella-Gutierrez et al. 2014). Although the concepts 
of phylogenetic markers and DNA barcodes differ in principle, 
they overlap in application. Their ideal characteristics include 
adequate species resolution, ease of amplification, absence of 
extreme length variation, the presence of only single copies, 
and low intra-species variability. Numerous attempts were made 
to identify loci with suitable primary barcode characteristics. 
These include efforts targeting: i) Cox? (or CO1; Seifert et al. 
2007, Dentinger et al. 2011, Robideau et al. 2011), the primary 
barcode for animals ratified by CBOL (Hebert et al. 2003a, b, 
Schindel & Miller 2005); ii) the AFTOL (http://aftol.org/about. 
php) genes (e.g. RPB1, RPB2, nucLSU, nucSSU, mtSSU, 
TEF1a and mtATP6), partially used by James et al. (2006) 
and evaluated for their barcoding potential by Schoch et al. 
(2012); iii) non-universal regions such as ND6 (hypothetical 
protein), CAL (Calmodulin), ACT (Gamma Actin) or TUB2 (Beta 
Tubulin 2) (Carbone & Kohn 1999, Aveskamp et al. 2009, Lee 
& Young 2009, Verkley et al. 2014); iv) the minichromosome 
maintenance complex MCM7 (a DNA helicase) and Tsr1 (a 
pre-mRNA processing protein homolog), the first loci extracted 
from genome-based computational predictions (Aguileta et al. 
2008, Schmitt et al. 2009) and FG1093 and MS204, selected 
from a screen of 25 single copy protein coding genes (Walker 
et al. 2012). 
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The relatively small number of barcode markers now available 
strongly reflect the past ‘poor man’s approach’ of data-driven 
gene selection (Eberhardt 2010, Capella-Gutierrez et al. 2014). 
As emphasised by Eberhardt (2010), group or clade-specific 
questions continue to require different, more specialised spe- 
cies identification solutions (Balajee et al. 2009, Lumbsch & 
Leavitt 2011, Gao & Zhang 2013, Heinrichs et al. 2012). This 
persists because attempts to identify alternatives or substi- 
tutes for rDNA sequences that meet the requirements of a 
single ‘primary barcode’ have not yet succeeded. Discovery 
of the ‘golden bullet barcode’ seems more unlikely than ever. 
Inconsistencies between the results of different computational 
analysis and their identified ‘best’ genes (Tautz et al. 2003, 
Avise 2004, Feau et al. 2011, Lewis et al. 2011, Robert et al. 
2011, Capella-Gutierrez et al. 2014) remain discouraging. Most 
mycologists agree that a compromise solution of combining ITS 
with secondary or tertiary, ‘group-specific’ DNA barcode(s) is the 
most realistic solution. This would combine the universal primer 
fidelity and high taxon coverage possible with ITS, with equally 
robust primers for secondary barcodes specific to the group or 
taxon of interest, enhancing precision of species identification. 
To take an analogy from medical diagnostics, this is an itera- 
tive diagnostic process, where the results of a primary analysis 
(ITS sequencing) can be used to determine what secondary 
analyses should be performed (Irinyi et al. 2015a, b). 


In adopting additional barcodes, the obvious absence of com- 
plete reference data is a serious problem, and was one of the 
main arguments presented against considering Cox7 a primary 
barcode for fungi (other problems eventually emerged, see e.g. 
Gilmore et al. 2009). To be truly effective for specialised and 
non-specialised applications, such as phytopathology (plant 
diseases control and quarantine), clinical applications (diagnos- 
tics, disease control, epidemiology) or environmental studies 
(ecology, conservation, metagenomics), formally adopted bar- 
codes will need to be highly predictive, and thus comprehensive 
reference datasets will need to be developed. Ideally, such data 
should leave almost no margin for error, reduce false-positives 
even in the absence of a perfect match, and allow for additive 
compensation via the primary barcode (ITS) when unknowns 
are likely to occur and reference data for rare taxa is scarce, 
as is the case in fungal ecology (Tedersoo et al. 2008, Buée et 
al. 2009, Pawlowski et al. 2012). 


The selection process for novel high fidelity ‘phylogenetic mark- 
ers’ and ‘DNA barcodes’ is mostly determined by laboratory 
practicalities. /n silico analyses of complete genomes (Capella- 
Gutierrez et al. 2014) to identify combinations of genes most 
informative to establish phylogenetic relationships are concep- 
tually similar. Current marker selection procedures are biased 
towards ranking single genes, rather than combinations of 
genes, reducing resolution power. The question arises which 
approach is optimal for identifying potential secondary or tertiary 
barcodes and how selection procedures, formerly achieved by 
‘trial-and-error’, can be optimised to avoid arbitrary selections? 


At present, the growth of the number of complete fungal genome 
sequences and the steady increase of comparative genomic 
tools allow an unprecedented view on variability among genes 
and species, on sequence homology and gene synteny. The 
detection of orthologs and paralogs is crucial for inference 
of robust molecular classifications. Understanding should 
go beyond phylogenetics and determination of species limits 
(Grigoriev et al. 2014, Nagy et al. 2014). 


A key concept in the present study was the in silico selection 
process for alternative candidate barcodes, following strate- 
gies described by Robert et al. (2011) and Lewis et al. (2011) 
to search for gene regions in complete fungal genomes under 
several optimality criteria. Lewis et al. (2011) inferred suitable 


candidate genes using translated protein sequences (Pfam do- 
mains), whereas Robert et al. (2011) focused entirely on nucleic 
acids. Both studies identified gene sections and identified novel 
primer pairs that functioned in silico for the genomes available 
at the time, either as universal fungal primers, or as primers 
targeting phyla or classes. In our study, the primers were rigor- 
ously tested and later distributed for independent verification to 
contributing laboratories. Departing from the study of Robert et 
al. (2011), we tested and compared amplification efficiency of 
two nuclear ribosomal regions (ITS and LSU, D1—D2 domains 
of 26/288), the 5’ primed end of B-tubulin2 (TUB2) and y-actin 
(ACT), the section ‘6-7’ of the second largest subunit of the 
RNA-polymerase II gene (RPB2), the commonly used interme- 
diate section of translation elongation factor 1-a (TEF 1a) (Sasi- 
kumar et al. 2012) corresponding to the section 983—1567 bp 
(in the rust Puccinia graminis), two novel universal high fidelity 
TEF1a primer alternatives covering approximately the same 
region, three sections of the newly identified candidate region 
and fungus-specific gene, translation elongation factor 3 (TEF3) 
(Ypma-Wong et al. 1992, Belfield et al. 1995, Belfield & Tuite 
2006, Greganova et al. 2011), and one of the small ribosomal 
proteins required for tRNA transfer, the 60S L10 (L1) (Brodersen 
& Nissen 2005). From the study of Lewis et al. (2011) based 
on Pfam domains, we tested ITS against the seven genes that 
could theoretically amplify a wide range of fungal taxa with a 
single primer pair, namely, phosphoglycerate kinase (PGK, 
PF00162), DNA topoisomerase | (TOPI, PF02919), Lipin/ 
Ned1/Smp2 (LNS2, PF08235), Indole-3-glycerol phosphate 
synthase (/IGPS, PF00218), Phosphoribosylaminoimidazole 
carboxylase PurE domain (PurE, PF00731), Peptide methionine 
sulphoxide reductase (Msr, PF01625) and Vacuolar ATPase 
(VATP, PF01992). Overall, universal primer performance was 
recorded for almost 20 000 individual PCR reactions, and se- 
lected sequence data were evaluated using a ‘distance’ — rather 
than a ‘character’ — matrix approach to accommodate a puristic 
barcoding concept, since the ‘barcoding-gap’ is traditionally 
calculated from Kimura-2-parametric distance model (K2P) 
(Kimura 1980). Our taxon selection covers several major line- 
ages of the fungal kingdom and represents hundreds of species 
of economic, phytopathological and clinical importance among 
the Agaricomycotina, Pezizomycotina, Pucciniomycotina, Sac- 
charomycotina and Ustilaginomycotina. 


MATERIAL AND METHODS 


Fungal cultures and specimens 


For the barcode primer search by the complete genome ap- 
proach of Robert et al. (2011), axenic cultures preserved and 
deposited in several biological resource centres, private col- 
lections and fungaria were used as sources for DNA. Tested 
fungal cultures are circumscribed according to their higher level 
taxonomic affiliation, main taxa (species level), quantity per 
‘dataset’ source and corresponding collaborator in Table 1. A 
detailed taxon list for each ‘dataset’ is available upon request. 
Cultures were either activated from lyophilised or cryopreserved 
material and inoculated on various media, such as oatmeal (OA) 
and 2 % malt extract (MEA, Oxoid) agars, prepared accord- 
ing to Crous et al. (2009). Alternatively, preserved cultures were 
directly used for DNA isolation from lyophilised or cryopreserved 
stocks. The selected ‘datasets’ covered several important taxa 
within the Agaricomycotina, Pezizomycotina, Pucciniomycotina, 
Saccharomycotina and Ustilaginomycotina. In total 1 931 fungal 
strains and fungarium specimens representing > 1 500 fungal 
species (exact taxonomic status for some strains was not pre- 
cisely determined ‘spp.’) were selected for benchmarking newly 
designed against published primers. For the barcode primer 
search by the Pfam domain approach of Lewis et al. (2011), 
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Table 1 Overview on datasets and selected taxa. 


Source, Collaborator 


Quantity 


Main taxa 


Family 


Dataset 


CBS, Houbraken 
CBS, Yilmaz 


96 


® 
© 
D 
O 
© 
S 
S 
Q 
is) 
Š 
me) 
E 


96 


Trichocomaceae 
Microascaceae 


Penicillium spp., Talaromyces spp. 
Scedosporium apiospermum 


Scedosporium spp. 


Inst. Pasteur, Hermoso 


30 
22 


180 
207 


Scedosporium | 


Westmead Hospital, Meyer & Irinyi 


Microascaceae 


Scedosporium Il 
Onygenales 


CBS, Dukik, Moreno, De Hoog, Stielow 
Westmead Hospital, Meyer & Irinyi 
CBS, Groenewald, Boekhout 


Trichophyton, Epidermophyton, Arthroderma, Microsporum, Chrysosporium spp. 


Arthrodermataceae and Ajellomycetaceae 


Saccharomycetaceae 


Candida albicans, C. catenulata, C. dubliniensis, C. glabrata, Kodamaea spp., Clavispora spp. 


Candida albicans, C. tropicalis, C. orthopsilosis, C. metapsilosis 


Debaromyces, Millerozyma spp. 


Ascomycetous yeast | 


96 
24 


Saccharomycetaceae 


Ascomycetous yeast II 


CIRM-Levures, INRA, Casaregola, Mallet & Jacques 


Microbiology Perugia, Cardinali & Roscini 


Saccharomycetaceae 


Ascomycetous yeast III 


Debaromyces, Millerozyma, Saccharomyces spp. 


Colletotrichum spp. 


Saccharomycetaceae 


Ascomycetous yeast IV 
Colletotrichum 
Hypocreales 

Fusarium 


Senkenberg Museum Goerlitz, Damm 


CBS, Lombard 


50 


Glomerellaceae 
Nectriaceae 


96 


Cylindrocarpon, Fusarium, Nectria spp. 
Fusarium fujikuroi sensu stricto 


Ochroconis spp. 


CBS, Al-Hatmi, Van Diepeningen 


CBS, Samerpitak 


96 


Nectriaceae 


50 


Sympoventuriaceae 


Ochroconis 


FABI, Wingfield, De Beer, Barnes & Duong 


CBS, Verkley & Stielow 
CBS, Quaedvlieg 


21 


Ceratocystis spp. 


Ceratocystidaceae 
Montagnulaceae 


Ceratocystis 


96 


Coniothyrium spp. 


Coniothyrium 


45 
90 


Mycosphaerella, Septoria spp. 


Mycosphaerellaceae 


Mycosphaerellaceae 


CBS, Egidi, Binder & Quaedvlieg 


Teratosphaeria spp. sensu lato 
Polyporus, Trametes spp. 


Russula spp. 


Teratosphaeriaceae 
Polyporaceae 


Teratosphaeriaceae 
Polyporaceae 
Russula 


INRA, Chaduli, Lomascolo, Welti & Lesage-Meessen 
Museum fiir Naturkunde Stuttgart, Eberhardt 

Royal Botanic Garden Madrid, Martin 

Westmead Hospital, Meyer & Irinyi 


20 
47 


Russulaceae 


Pisolithus, Phallus and Astraeus spp. 
Cryptococcus, Rhodotorula spp. 


Sclerodermataceae, Phallaceae and Diplocystaceae 


Gasteromycetes 


Tremellaceae 


Basidiomycetous yeast | 


DSMZ & Ruhr University Bochum, Yrukov & Begerow 


174 
182 


Cryptococcus, Kwoniella, Bullera spp. 


Tremellaceae sensu lato 


Basidiomycetous yeast II 


Multi-Phyla 


Agriculture and Agri-Food Canada (AAFC), Ottawa 


Rhizoclosmatium, Mortierella spp., Fusarium spp., Drechslera spp., Cadophora, 


Chytridiaceae, Mortierellaceae, Nectriaceae, 
Pleosporaceae, Pucciniaceae, Rhizophydiaceae, 


Pyrenophora spp., Ulocladium, Epicoccum, Dendryphion, Alternaria, Puccinia spp., 


Rhizophydium spp., Tilletia spp., Wallemia spp. and 40 more 


Tilletiaceae, Wallemiaceae and 35 more 


fungal cultures were obtained from a number of laboratories 
at the Eastern Cereal and Oilseed Research Centre, ECORC, 
of Agriculture and Agri-Food Canada, CEF, Ottawa. They were 
supplied as fungal cultures from which DNA was extracted or 
as already purified DNA samples. 


Standard molecular procedures 


For the primers inferred from the complete genome through 
the approach by Robert et al. (2011), total genomic DNA was 
extracted from living cultures, cells preserved in liquid nitro- 
gen or in lyophilisation and from dried fungarium specimens 
(Agaricomycotina only) using different variations of one-by-one 
or high-throughput 96-well plate DNA extraction techniques, 
routinely used by the respective collaborating laboratories 
(Ferrer et al. 2001, Ivanova et al. 2006, Yurkov et al. 2012, 
2015, Feng et al. 2013, Verkley et al. 2014). DNA extraction 
and PCR protocols differed between participating laboratories. 


Primers and amplification conditions used varied between labo- 
ratories, an example of the ones used at the CBS is provided. 
PCR reactions for amplification of the ITS barcode employed 
primers ITS5/ITS1/ITS1F and ITS4 were performed under 
standard or semi-nested conditions in 12.5 uL reactions (the 
CBS-KNAW barcoding lab protocol) containing 2.5 uL purified 
DNA, 1.25 uL PCR buffer (Takara, Japan, incl. 2.5 mM MgCl,), 
1 uL dNTPs (1 mM stock; Takara, Japan), 0.6 uL v/v DMSO 
(Sigma, Netherlands), forward-reverse primer 0.25 uL each 
(10 mM stock), 0.06 uL (5 U) Takara HS Taq polymerase, 7.19 
uL MilliQ water (White et al. 1990, Stielow et al. 2012, Yurkov 
et al. 2012). PCR conditions for amplifying partial LSU rDNA 
using primers LROR and LR5 differed only by their annealing 
temperature (55 °C instead of 60 °C) and increased cycle ex- 
tension time (90 s per cycle). Amplification of partial y-actin 
(ACT), covering the more variable 5’-end including two small 
introns, and partial B-tubulin 2 (TUB2), covering the variable 
5’-end with up to four small introns, followed the protocol of 
Aveskamp et al. (2009) and Carbone & Kohn (1999) using the 
primers Btub2Fd and Btub4Rd, and ACT-512F, ACT-783R, 
respectively. TEF1a and RPB2 were amplified following the 
protocols of Rehner & Buckley (2005) and Liu et al. (1999), 
respectively (Table 2). PCR products were directly purified us- 
ing FastAP thermosensitive alkaline phosphatase and shrimp 
alkaline phosphatase (Fermentas, Thermo Fisher Scientific). 
Cycle-sequencing reactions were set up using ABI BigDye Ter- 
minator v. 3.1 Cycle Sequencing kit (Thermo Fisher Scientific), 
with the manufacturers’ protocol modified by using a quarter of 
the recommended volumes, followed by bidirectional sequenc- 
ing with a 3730xI DNA Analyser (Thermo Fisher Scientific). 
Sequences were archived, bidirectional reads assembled and 
manually corrected for sequencing artefacts using BioloMICS 
software v. 8.0 (www.bio-aware.com) (Vu et al. 2012). Edited 
sequences were exported to and aligned with MAFFT v. 7.0 
(Katoh et al. 2005) and further corrected for indels and SNPs 
(single nucleotide polymorphisms) by replacing respective 
positions with ambiguity code letters. 


For the primers inferred from the complete proteome through 
the Pfam approach by Lewis et al. (2011), genomic DNA from 
fungal cultures was extracted using the OmniPrep™ Genomic 
DNA Extraction Kit (G-Biosciences, St. Louis, Missouri). Fungal 
tissue was ground into a fine powder in a mortar and pestle 
with liquid nitrogen and stored at -80 °C. Approximately 50 mg 
of ground tissue was placed in a 1.7 mL microfuge tube, resus- 
pended in 250 uL of lysis buffer and vortexed for several sec- 
onds. An additional 250 uL of lysis buffer was added to the re- 
suspension and the tube was incubated for 15 min at 55—60 °C 
without the addition of Proteinase K. The samples were cooled 
to room temperature and 200 uL of chloroform was added to 
the tube. The tube was mixed by inversion several times and 
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Table 2 Benchmarking primers used to assess performance and versatility of newly designed primers. 


Locus Primer Oligo nucleotides (5’-3’) Reference 
ITS ITS5 GGAAGTAAAAGTCGTAACAAGG Ward & Adams (1998) 
ITS4 TCCTCCGCTTATTGATATGC White et al. (1990) 
ITS1 TCC GTA GGT GAA CCT GCG G White et al. (1990) 
ITS1-F CTT GGT CAT TTA GAG GAA GTAA Gardes & Bruns (1993) 
LSU LROR ACCCGCTGAACTTAAGC Vilgalys & Hester (1990) 
LR5 TCCTGAGGGAAACTTCG 
TUB2 Btub2Fd GTBCACCTYCARACCGGYCARTG Woudenberg et al. (2009) 
Btub4Rd CCRGAYTGRCCRAARACRAAGTTGTC 
Bsens ATCACWCACTCICTIGGTGGTGG Lesage-Meessen et al. (2011) 
Brev CATGAAGAARTGIAGACGIGGG 
ACT ACT512f ATGTGCAAGGCCGGTTTCG Carbone & Kohn (1999) 
ACT783r TACGAGTCCTTCTGGCCCAT 
CA5R GTGAACAATGGATGGACCAGATTCGTCG Daniel et al. (2001), Daniel & Meyer (2003) 
CA14 AACTGGGATGACATGGAGAAGATCTGGC 
RPB2 fRPB2-5f GAY GAY MGW GAT CAY TTY GG Liu et al. (1999), Binder & Hibbett (Clark University website) 
fRPB2-7cF ATG GGY AAR CAA GCY ATG GG 
fRPB2-7cR CCC ATR GCT TGY TTR CCC AT 
TEF1a EF1-983F GCY CCY GGH CAY CGT GAY TTY AT Rehner & Buckley (2005) 
EF1-1567R ACH GTR CCR ATA CCA CCR ATC TT 


then centrifuged for 10 min at 14 000 xg. The upper aqueous 
phase was removed to a new microfuge tube and 100 uL of 
precipitation solution was added. If no white precipitate was 
produced an additional 50 uL of precipitation solution was added 
to the tube. The white precipitate was pelleted by centrifugation 
and the supernatant was moved to a fresh tube. The genomic 
DNA was precipitated by the addition of 500 uL of isopropanol 
to the supernatant and inversion of the tube several times. The 
genomic DNA was pelleted by centrifugation and washed with 
700 uL of 70 % ethanol. The ethanol was decanted and the 
pellet was air dried for 15 s prior to resuspension in 50 uL of TE 
Buffer. DNA concentration was determined using the Qubit® 
2.0 Fluorometer (Life Technologies, Burlington, Canada) and 
working solutions were prepared at a concentration of 0.05 ng/ 
uL. PCR was carried out on 0.1 ng of genomic DNA in a total 
volume of 20 uL using 0.2 mM dNTPs, 0.5 uM of each primer, 
1x of Titanium Taq Buffer and Titanium Taq DNA Polymerase 
(Clontech) using an Eppendorf GradientS thermal cycler. For 
the ITS primers, an initial denaturation at 95 °C for 3 min was 
followed by 40 cycles at the following conditions: 30 s at 95 °C, 
45 s at 58 °C and 2 min at 72 °C. A final extension at 72 °C for 
8 min completed the PCR. For the remaining primers, Touch- 
down PCR was performed where an initial denaturation at 
95 °C for 5 min was followed by 10 cycles of 45 s at 95 °C, 
45 s starting at 68 °C and dropping by 1 °C per cycle until a 
temperature of 58 °C was reached and a 1 min extension at 
72 °C. The initial 10 cycles were then followed by 35 cycles of 
45 s at 95 °C, 45 s at 58 °C and 1 min at 72 °C. A final exten- 
sion at 72 °C for 5 min completed the PCR. 


In silico selection of gene and protein regions and initial 
seed primers 


Two different genome-mining strategies were employed to 
identify potential new barcode markers and different strategies 
were used to develop them further and for validation. 


For the DNA-based approach, complete genome sequences 
of 74 fungi were downloaded from public repositories such as 
Broad, Genoscope and NCBI in 2011, covering most major 
lineages of the fungal kingdom. The initial approach to find ideal 
gene regions was described in Robert et al. (2011) as the ‘Ideal 
Locus Method’ (ILM). An ideal locus is a gene or gene region 
that would provide a phylogram possibly close to a ‘whole- 
genome phylogram’ (a phylogeny inferred from single copy 
orthologous genes). A ‘Best Pair of Primers Method’ (BPPM) 


was devised to identify short conserved sections at < 1 Kb apart 
that could serve as kernels to construct forward and reverse 
primers for species or selected taxonomic groups. Suitability 
as barcode candidate was assessed by calculating Pearson 
coefficients against the super matrix, built from overlapping 
orthologs (Kuramae et al. 2007). 


PCR primers are normally 18—24 bp long, but in practice only 
the 3’ end fit is critical for PCR success. Hence, an alternative 
‘Short Primer Method’ (SPM) was developed as follows: i) listing 
all possible 12 bp primers as ‘Length 12’ (maximum to limit cal- 
culations); ii) searching for these 12-mer primers in all available 
fungal genomes, which is considerably faster than searching 
for primer pairs; iii) maintaining primers with sufficient quality 
in terms of PCR applicability and wide distribution. Factors in 
the equation are: F1 = number of most frequent nucleotides / 
size of the sequence; F2 = comparison of the sequence with 
itself, with an offset of three nucleotides, this factor is high for 
repetitive triplets (e.g. ACCACCACC...n); F3 = comparison 
of the sequence with its reverse complement, with an offset 
greater than half the sequence size (avoidance of hairpins). All 
factors (Fn) take a value of 0.25 for a random sequence, and 
a value of 1.0 for a poor sequence. Thus, the factors F2 and 
F3 were rescaled with the following equation: F = (F — 0.25) / 
0.75, the final equation is given by: Quality = 1.0 — max (F1, F2, 
F3); iv) using these best primers to search for possible primer 
pairs; v) using the best primer pairs to extract the intervening 
DNA sequence. The target length of the intermediate sequence 
was set to 200 to 1 000 nucleotides, thresholds chosen to en- 
sure a minimum variability and allow ease of amplification and 
Sanger sequencing. The formal DNA standard recommends 
amplicons of about 700 bp, reflecting the technical constraints 
of the Sanger sequencing prevalent when the standard was 
implemented; vi) comparing the in silico sequences and build- 
ing distance matrices and trees to compare with a reference 
matrix and tree. This last step ensures that the selected DNA 
region produces a relatively coherent phylogeny compared 
to the reference matrix and that it would also be suitable for 
discriminating closely related species. 


The in silico design, execution and publication of the resulting 
primers based on identification of protein families (see Lewis et 
al. 2011) was done concurrently with the DNA-based method 
described above. These primers were extensively tested only 
at the Agriculture & Agri-Food Canada laboratory on smaller 
fungal test sets by Levesque, Seifert, Hambleton, Lewis and 
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Table 3 Initial seed primers from ‘nucleic acid based’ computational predictions (Robert et al. 2011). 


Alignment Species in Primers (forward-reverse) Functional annotation 

group alignment (Al) 
1 51 acaagcgtttct-catcaagttcca Hypothetical protein 
2 41 acatggagaaga-catcaaggagaa Gamma actin 
3 44 accttcttgatg-atgttcttgatg Translation elongation factor 1a 
3b 46 caagaacatgat-catcaagaaggt Translation elongation factor 1a 
4 38 agtacttgtagg-cttggccttgta 60S ribosomal protein L15b 
5 51 ggaacttgatgg-agaaacgcttgt 60S ribosomal protein L15b 
6 45 ggtatcaccatc-caacaagatgga Translation elongation factor 1a 
7 45 tcecatcttgttg-gatggtgatacc Translation elongation factor 1a 
8 36 gtccatcttgtt-tacttgaaggaa Translation elongation factor 1a 
9 56 gttcttggagtc-gtccatcttgtt Translation elongation factor 1a 
10 56 aacaagatggac-ctccaagaacga Translation elongation factor 1a 
11 41 aacaagatggac-tcaccactgaag Translation elongation factor 1a 
11b 42 tggtatctccaa-aacgtcaagaac Translation elongation factor 1a 
12 58 caacaagatgga-ctccaagaacga Translation elongation factor 1a 
13 42 cacttcttcatg-atggacgagatg 60S ribosomal protein L15b 
14 37 catatgcttgtc-gactcgtcatct Putative CD box sno-RNA protein,putative protein S6 kinase 
15 43 cttcagtggtga-ccatcttgttga Translation elongation factor 1a 
16 51 gaacttgatggt-agaaacgcttgt Hypothetical protein 
17 36 gttccttcaagt-caacaagatgga Hypothetical protein 
18 36 tecatcttgttg-acttgaaggaac Translation elongation factor 1a 
19 48 tcttgacgttga-gtccatcttgtt Translation elongation factor 1a 
20 48 ttcttgacgttg-gtccatcttgtt Translation elongation factor 1a 
21 39 tegttcttggag-tcttgatgaagt Translation elongation factor 1a 
21b 53 ttcttggagtca-tccatcttgttg Translation elongation factor 1a 
22 45 ttctigacgttg-catgttcttgat Translation elongation factor 1a 
23 37 ttccttcaagta-caacaagatgga Translation elongation factor 1a 
24 45 ttcatcaagaac-tcaacgtcaaga Translation elongation factor 1a 
25 41 ttcagtggtgac-atcatgttcttg Translation elongation factor 1a 
26 39 tecttgatttcg-cttcatgacctt Translation elongation factor 1a 
27 37 tcaagaaggtcg-ctccaagaacga Translation elongation factor 1a 
28 46 tcaagaacatga-caacgtcaagaa Translation elongation factor 1a 
29 40 caacaagatgga-aagttcatcaag Translation elongation factor 1a 
30 47 gttcttgacgtt-gtccatcttgtt Translation elongation factor 1a 
31 48 aacaagatggac-caacgtcaagaa Translation elongation factor 1a 
32 39 gacttgatgaac-ccatcttgttga Translation elongation factor 1a 
33 45 catcaagaacat-tcaacgtcaaga Translation elongation factor 1a 
34 52 catcaagaacat-gactccaagaac Translation elongation factor 1a 
35 40 catcaagaaggt-gactccaagaac Translation elongation factor 1a 
36 40 atcaagaacatg-tcaccactgaag Translation elongation factor 1a 
37 51 caagcgtttctc-ccatcaagttcc Translation elongation factor 1a 
38 39 atctccaaggat-ctccaagaacga Translation elongation factor 1a 
39 52 atcaagaacatg-gactccaagaac Translation elongation factor 1a 
40 45 atcaagaacatg-tcaacgtcaaga Translation elongation factor 1a 
41 40 acttgatgaact-tccatcttgttg Translation elongation factor 1a 
42 37 acttcatcaaga-tcaacgtcaaga Translation elongation factor 1a 
43 40 acaagatggaca-gactccaagaac Translation elongation factor 1a 
44 38 aacaagatggac-aagttcatcaag Translation elongation factor 1a 
45 39 aaggtcatgaag-cgaaatcaagga Translation elongation factor 2 
46 42 catctcgtccat-catgaagaagtg Beta tubulin 2 
47 45 gttcttgaactt-cttcatcttcca Translation elongation factor 3 
48 39 tecttgatttcg-ccatcttggaga Translation elongation factor 3 
49 45 tggaagatgaag-aagttcaagaac Translation elongation factor 3 
50 42 tggaagatgaag-tgtcaagaccaa Translation elongation factor 3 
51 42 ttggtcttgaca-cttcatcttcca Translation elongation factor 3 
52 51 tggaacttgatg-gagaaacgcttg 60S ribosomal protein L10 (L1) 
53 35 ggtatcaccatc-ctccaagaacga Translation elongation factor 1a 
54 40 ttcatcaagaac-tcaccactgaag Translation elongation factor 1a 


Seed primer sequences highlighted bold, were those qualifying for the last and final cross-laboratory trial. 


McCormick, with additional testing of primers done at the 
CBS-KNAW Fungal Biodiversity Centre. The in silico pipeline 
for identifying suitable protein sequences for primer design is 
briefly described below. 


The approach was inspired by the CARMA algorithm for taxo- 
nomic identification of metagenomic sequences (Krause et al. 
2008). CARMA matches short environmental gene fragments to 
Pfam protein families and constructs a taxonomic profile for the 
sample. The objective is to assign translated input nucleotide 
sequences to Pfam accessions that can identify single copy 
gene regions in source organisms and then design degener- 
ate primers from the alignment of these putatively orthologous 
sequences. This requires pre-processing the Pfam dataset, 


translating and assigning DNA sequences to Pfam groups, 
and adding sequences to the reference Pfam alignments. The 
CARMA pipeline was adapted for use in this project, although 
the original intent and our application are quite different. 


Data were processed in two stages. The first stage translates 
input sequences and assigns them to Pfam accessions; the 
second stage prepares alignments and attempts primer design 
for the selected set of taxa. Adding additional or updating data 
for existing organisms requires repetition of the first stage. Re- 
stricting the analysis to a taxonomic subset of the data requires 
that the final stage be repeated for the selected organisms. 
The pipeline requires up-to-date versions of the Pfam-A and 
NCBI taxonomy datasets; if the Pfam dataset is updated, the 
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first stage of processing must be repeated. The input datasets 
are placed ina single folder for processing, with each organism 
in a unique directory named by ‘genus_species_strain’. The 
pipeline first processes source nucleotide sequences to identify 
Pfam domains within each sequence and creates clusters of 
conserved domains. Sequences are then translated into protein 
sequences using the orientation and frame from a translated 
BLAST search against the Pfam sequences. Each translated 
sequence is then processed to identify the region correspond- 
ing to the matched protein family and the protein sequence is 
written to a file. This functionality is as originally implemented 
in the CARMA pipeline. The second stage of the pipeline in- 
volves screening the Pfam matches for a user-defined subset 
of organisms to identify families that contain a single match for 
each of the selected organisms. Such families are considered 
to represent conserved, single copy gene regions. Next, these 
putatively orthologous sequences are added to the correspond- 
ing Pfam multiple sequence alignment. Adding the sequences 
to an existing curated alignment results in a superior alignment 
to de novo sequence alignment. Original sequences are then 
removed from the reference alignments so that only targeted 
organisms remain and the resulting alignment is screened for 
conserved blocks for primer design. Conserved regions flank- 
ing variable sites are identified and only alignments with two 
or more conserved blocks are selected so that pairs of forward 
and reverse primers can be sought in the separate blocks. The 
resulting primers were further screened for protein families 
containing primer pairs with additional desirable characteristics, 
such as optimum amplicon length. 


Final primer design 


Two different genome and proteome mining strategies were 
employed to identify the initial potential new barcode markers 
(Lewis et al. 2011, Robert et al. 2011) and different strategies 
were also used to develop them further and for validation. 


The selected gene regions by the complete genome approach 
of Robert et al. (2011) corresponding to a total of 54 alignments 
(available on request) were manually re-annotated using n- and 
x-BLAST and alignments resulting in identical annotations were 
realigned and redundant alignments discarded. These seed 
primers are given in Table 3. To design PCR primers, non- 
redundant alignments were imported into BioEdit v. 7.0.52 and 
visually screened for the most conserved sites to identify suit- 
able primer binding sites. Priority was given to stable 3’ prime 
ends with at least eight nucleotide binding sites and the least 
possible amount of degeneracy. For several primers, ‘N’s were 
replaced with an inosine, ‘s, to improve binding stability and to 
reduce the quantity of primers in the actual synthesized wobble 
pool. Individual primer pairs are discussed in the results section. 


For the Pfam domain approach by Lewis et al. (2011), a nu- 
cleotide search was first conducted within NCBI for Interpro 
names to help with primer validation, using keywords found 
within the paper (e.g., PF01625, PGK, IPR001576) and NCBI 
was searched for more recent accessions using trimmed amino 
acid sequences for each of the inferred protein regions as 
queries. The downloaded nucleotide sequences were added to 
the original data from Lewis et al. (2011). AClustalW alignment 
was performed individually on the supplied trimmed sequences 
and NCBI downloaded sequences. For LNS2, the alignments 
were primarily from basidiomycete sequences as the objective 
was to design primers that would have a higher success for this 
group. The alignments were handled individually to ensure that 
all possible primers for each gene could be identified. Within 
each gene alignment, frequency tables were developed for con- 
served regions. All areas of the protein region that contained, 
at a minimum, 3 conserved nucleotides (> 85 %) at the 3’ end, 
were examined for the potential of becoming a primer site. In 


Table 4 Initial test cultures used for primary laboratory trial | for primers in- 
ferred from ‘nucleic acid based’ computational predictions (Robert et al. 2011). 


CBS number Taxon 

CBS 513.88 Aspergillus niger 

CBS 818.72 Aspergillus oryzae var. brunneus 
CBS 1954 Candida parapsilosis var. parapsilosis 
CBS 115846 Cryphonectria parasitica 

CBS 123668 Fusarium oxysporum f.sp. lycopersici 
CBS 445.79 Laccaria bicolor 

CBS 277.49 Mucor circinelloides f. lusitanicus 
FGSC 9596 Nectria haematococca 

CBS 708.71 Neurospora crassa 

FGSC 10004 Phycomyces blakesleeanus 

CBS 405.96 Schizophyllum commune 

CBS 142.95 Trichoderma atroviride 

CBS 109036 Trichophyton equinum 

CBS 127170 Verticillium dahliae 

CBS 115943 Zymoseptoria tritici 


Table 5 Initial test cultures used for secondary laboratory trial II for prim- 
ers inferred from ‘nucleic acid based’ computational predictions (Robert et 
al. 2011). 


CBS number Taxon 


CBS 674.68 Ajellomyces dermatitidis 


CBS 118699 Alternaria brassicicola 

CBS 131.61 Aspergillus flavus var. flavus 
CBS 126972 Aspergillus nidulans 

FGSC 1144 Aspergillus niger 

FGSC 1156 Aspergillus terreus 

CBS 136.29 Bipolaris maydis 

CBS 114389 Blastomyces dermatitidis 

CBS 8758 Candida albicans var. albicans 
CBS 113850 Coccidioides immitis 

CBS 113843 Coccidioides posadasii 

CBS 126970 Coprinus cinereus 

CBS 8710 Filobasidiella neoformans 
FGSC 9075 Fusarium graminarum 

FGSC 9935 Fusarium oxysporum f.sp. lycopersici 
CBS 123670 Fusarium verticillioides 

FGSC 1089 Gibberella fujikuroi 

CBS 287.54 Histoplasma capsulatum 

CBS 2605 Lodderomyces elongisporus 
CBS 113480 Microsporum canis 

CBS 180.27 Neurospora tetrasperma 

CBS 223.38 Neurospora tetrasperma 

CBS 101191 Neurospora tetrasperma 

CBS 372.73 Paracoccidioides brasiliensis 
CBS 127171 Parastagonospora nodorum 
FGSC 9002 Phanerochaete chrysosporium 
CBS 6054 Pichia stipitis 

CBS 126969 Podospora pauciseta 

CBS 120258 Pseudocercospora fijiensis 
CBS 117146 Pyrenophora tritici-repentis 
CBS 128304 Pyricularia grisea 

CBS 658.66 Pyricularia grisea 

CBS 126971 Rhizopus orysae 

CBS 124811 Schizophyllum commune 

CBS 7116 Schizosaccharomyces japonicus 
CBS 484 Sporidiobolus pararoseus 
CBS 208.27 Sporobolomyces roseus 

CBS 375.48 Talaromyces Stipitatus 

CBS 693.94 Trichoderma atroviride 

CBS 392.92 Trichoderma reesei 

CBS 383.78 Trichoderma reesei 

CBS 127.97 Trichophyton equinum var. equinum 
CBS 668.78 Uncinocarpus reesii 

CBS 127172 Ustilago maydis 

CBS 127169 Verticillium alboatrum 

CBS 599 Yarrowia lipolytica 

CBS 732 Zygosaccharomyces rouxii 


developing the primers, all base positions were examined visu- 
ally. Any base with a frequency > 80 % was maintained in the 
primer. Primers in which there was no base with a frequency 
> 80 %, were resolved into degenerate primers, such that the 
minimum frequency > 80 % was maintained. Close to 500 
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primer pair combinations were tested. The overall number of 
forward primers designed, of reverse primers designed, and 
number of primer pairs tested, were 26, 23 and 125 for PGK; 
16, 13 and 74 for TOP7; 20, 28 and 49 for LNS2; 24, 13 and 
62 for IGPS; 17, 11 and 64 for PurE; 12, 9 and 27 for Msr, and 
12, 8 and 32 for vATP, respectively. 


Laboratory trials of gene-based primers (phases | and II) 


For the primers designed by the complete genome approach 
from Robert et al. (2011), annealing temperatures were set 
at 42 °C and slowly ramped up to 52 °C. This allowed us to 
identify the most universally applicable primer pairs (one-by- 
one proof-of-principle) and those yielding nothing, or single or 
multiple fragments. The general PCR cycler setup was: 7 min 
95 °C, 1 min 95 °C, 1 min 42—52 °C (+ 0.5 °C/cycle until 52 °C 
is reached), 2 min 72 °C, the latter three steps repeated 40 
times, final elongation at 10 min 72 °C and cooling at 10 °C. The 
laboratory work was conducted in two initial phases. Phase | 
included a small set of genomic DNA extracts roughly covering 
several major fungal lineages (Table 4) representing strains 
with completely sequenced genomes, thus the same strains 
used for the in silico work described above (Robert et al. 2011). 
This trial aimed to identify well-performing primer sets and to 
assess their versatility, i.e. their ability to amplify the targeted 
gene from various genera. Primer pairs that yielded no product 
were discarded. Phase II was conducted with 48 DNA extracts 
(Table 5) to decrease taxon bias and increase the reliability 
and consistency of the entire experiment. Only primers that 
successfully amplified, either highly accurately and/or slightly 
inaccurately (defined as a single PCR product not exactly the 
calculated size), were further tested employing the PCR condi- 
tions described above. Amplifications that resulted in a single, 
clear fragment, without any further reaction optimisation, were 


tested under widely used successful PCR conditions, defined 
as: 5 min 95 °C, 1 min 95 °C, 1 min 48 or 50 °C, 2 min 72 °C, 
the latter three steps repeated 40 times, final elongation at 10 
min 72 °C and cooling at 10 °C. This PCR protocol is generally 
applicable to those primers listed in Table 6. Only those pairs 
yielding single fragments under these conditions were kept for 
the final phase. Phase III included in depth testing of the best 
candidate primers, described in the Results section, with large 
sets of DNA extracts covering multiple species representing 
economically important genera such as Penicillium sensu 
stricto (s.str.), Fusarium s.str., or higher level ranks covering 
several genera within orders such as the Onygenales (Table 1). 
The selected taxa (number of strains or specimens) for this 
phase represent those covered by the consortium of partici- 
pating laboratories. Benchmarking of the best newly designed 
candidate primers was conducted against commonly used and 
well-recognised primer pairs (Table 2). All reagents for phases 
| and Il were standardised with enzymes and dNTPs from 
Takara (Japan), oligonucleotides synthesized by Integrated 
DNA Technologies (The Netherlands) and PCR reactions ran 
on SensoQuest PCR cyclers (Germany) as described above 
under ‘standard molecular procedures’. Reagents for phase III 
varied among laboratories and institutions according to their 
internal protocols, and served as a robust verification of the 
tested primers (detailed protocols can be requested from the 
respective ‘collaborator’ as indicated in Table 1). 


For the primers designed from Pfam domains (Lewis et al. 
2011), initial PCR analysis of potential barcoding primers was 
performed on DNA from the ascomycetes Cadophora fas- 
tigiata and Pyrenophora teres f. teres using a standard PCR 
protocol and gradient annealing temperatures, ranging from 
56-70 °C, to determine an optimum annealing temperature for 
further testing. Touchdown PCR (Don et al. 1991) was used to 


Table 6 Super primers and best candidate primers inferred from ‘nucleic acid based’ computational predictions (Robert et al. 2011). 


Locus Original primer name Final primer name Sequence (5’-3’) 

TEF1a AI33_54 _73_F1_forward EF1-1018F GAYTTCATCAAGAACATGAT 
AI33_879_859_R2_reverse EF1-1620R GACGTTGAADCCRACRTTGTC 
Al_34_EF1_300_F1_forward EF1-1002F TTCATCAAGAACATGAT 
AL34_EF1_1050_R_Tail_reverse EF1-1688R GCTATCATCACAATGGACGTTCTTGGAG 
AI33_129_148_F2_forward AI33_alternative_f GARTTYGARGCYGGTATCTC 
Al28_EF1_400_f EF1_alterantive_3f TTYGARGCYGGTATCTC 
Al28_EF1_900_R EF1_alternative_3r GAVACRTTCTTGACGTTGAA 

60S L10 (L1) AlIGr52_412-433_f1 60S-908R CTTVAVYTGGAACTTGATGGT 
Algr52_1102_1084_R1 60S-506F GHGACAAGCGTTTCTCNGG 

TEF3 Al50+51_EF3_2900_f EF3-3185F TCYGGWGGHTGGAAGATGAAG 
Al50+51_EF3_3300_R EF3-3538R YTTGGTCTTGACACCNTC 
Al47_EF3_1650_forward EF3-3188F GGHGGHTGGAAGATGAAG 
A47_EF3_2451_R1_reverse EF3-3984R TCRTAVSWGTTCTTGAACTT 
Al49_ EF3_44 63_F1_forward EF3-3186F CYGGHGGHTGGAAGATGAAG 
Al49_ EF3_846_829_r1_reverse EF3-3984R2 TCRTAVSWGTTCTTGAAC 


Table 7 Super primers inferred from ‘protein-based’ computational predictions (Lewis et al. 2011). 


Locus Original primer name Final primer name Sequence (5’ to 3’) 

PGK PF00162.1120.M13-8-F PGK_480-F TGTAAAACGACGGCCAGTACGATATCCGAGTCGACTTCAAYGTCCC 
PF00162.2081.M13-8-R PGK_480-R CAGGAAACAGCTATGACTCGAAGACACCRGGGGGACCGTTCCA 
PF00162.1433.M13-8-F PGK_483-F TGTAAAACGACGGCCAGTACGATGAGAACYTGCGHTTCCACRYYGAGGAGGARGG 
PF00162.1793.M13-8-R PGK_483-R CAGGAAACAGCTATGACCTTCTTGAAGGTGAARGCCAT 
PF00162A.675.1-F PGK_511-F GTYGSTGCYYTGCCMACCATCAA 
PF00162A.1915.1-R PGK_511-R ATCTTGTCRGMRACCTTRGCACC 
PF00162.1127.1-F PGK_533-F GTYGAYTTCAAYGTYCC 
PF00162.2081.1-R PGK_533-R ACACCDGGDGGRCCGTTCCA 

TOP1 PF02919.2708.M13.1-F TOP1_501-F TGTAAAACGACGGCCAGTACGATACTGCCAAGGTTTTCCGTACHTACAACGC 
PF02919.3469.M13.8-R TOP1_501-R CAGGAAACAGCTATGACCCAGTCCTCGTCAACWGACTTRATRGCCCA 

LNS2 PF08235.1463.8-F LNS2_468-F GGCCATGTGCTGAACATGATCGGHCGWGAYTGGAC 
PF08235.1821.8-R LNS2_468-R CGGTTGCCRAAKCCRGCATAGAAKGG 


Bases in bold: M13 primers with an ACGAT spacer for the forward primers. 


250 


Persoonia — Volume 35, 2015 


increase primer specificity. The final Touchdown temperatures 
of 68—58 °C was selected on the basis of the PCR giving the 
best combination of product concentration and single fragment 
size produced for two of the three initial ascomycetes tested. 
For /GPS and vATP, none of the primer pairs passed this first 
panel. Primer pairs that resulted in acceptable sequences for 
this first panel were carried forward and tested against a panel 
of eight organisms: Rhizophydium littoreum (Chytridiomycetes), 
Mortierella vinacea (Zygomycetes), the basidiomycetes Til- 
letia indica and Puccinia graminis, and the ascomycetes Pyr- 
enophora teres f. teres, Debaryomyces hansenii, Penicillium 
verrucosum and Fusarium graminearum. Only primers for PGK, 
TOP1 and LNS2 were tested further with additional isolates and 
with tagging of M13 primers to improve sequencing success 
(Table 7). Throughout the process of primer development, newly 
acquired long sequences, as well as newly mined results of full 
length genes from GenBank (performed on a monthly basis) 
were used to improved primers for shorter fragments. 


Sequence and data analysis (phase Ill) 


To assure standardised laboratory procedures, a working 
agreement defining experimental conditions was distributed 
prior to primer testing in phase III, which defined the basis for 
recording amplification efficiency. To assemble a data matrix of 
binary variables, successful amplification was scored as ‘true’ 
and unsuccessful amplification as ‘false’. Recovery of multiple 
fragments was also classified as ‘false’. Single fragments of 
unexpected sizes were assumed to represent length variations 
(mostly false negatives), a common phenomenon when com- 
paring intron-containing sequences, were classified as ‘true’. 
This phenomenon does not always reciprocally correlate with 
an incorrect amplification of targeted gene section or, subse- 
quently, poor downstream (sequencing) success. 


All data were analysed and visualised with R statistical software 
(R Core Team 2014) with the ‘lattice’, ‘vcd’ and ‘MASS libraries, 
as well as related packages; the source code is available from 
Sarkar (2008). Primer amplification efficiencies were visual- 
ised as ‘barchart’ and ‘mosaicplot’. Gene maps with primer 
locations were created with the Qiagen CLC genomics suite 
(http://www.clcbio.com/products/clc-genomics-workbench/) 
primers designed by the complete genome approach by Robert 
et al. (2011) and with Geneious v. R6 (Biomatters http://www. 
geneious.com/) for the primers designed from Pfam domains 
(Lewis et al. 2011). 


For sequences generated from primers designed by the com- 
plete genome approach by Robert et al. (2011), sequences were 
stored as bidirectional reads and edited as described above, 
using BioloMICS (Vu et al. 2012) at CBS. Various software 
packages, differing between laboratories, were employed to 
assemble consensus sequences. Sequences were visually 
corrected for sequencing artefacts and quality controlled within 
individually aligned datasets. Quality controlled data was up- 
loaded to the BioloMICS database and pairwise distances 
for selected (symmetric/orthogonal matrices only) datasets 
calculated using the optimistic reverse pairwise alignment 
algorithm. Symmetrical datasets were assembled to include 
sequence data for at least: complete ITS1-5.8S-ITS2, LSU, 
TEF1a (either from one of the newly designed TEF1a primer 
pairs, see Results), or the TEF1a AFTOL primer set EF 1-983F 
/ EF1-1567R, Table 2, spanning almost the same section), 
TEF3, and the section spanning the 60S ribosomal protein L10 
(L1). The global dataset comprised 502 strains or specimens 
representing a fully symmetrical distance matrix with 5 gene 
partitions. Each dataset was analysed individually to create a 
single locus similarity matrix to generate an overview UPGMA 
tree. A second analysis used all loci in a concatenated matrix, 
followed by a multi-dimensional scaling analysis (MDS), visual- 


ised in n-dimensional space. All six similarity matrices (5 loci + 
1 concatenated) were compared, to obtain pairwise cophenetic 
coefficients of correlation (Mantel test) between each similarity 
matrix (Smouse et al. 1986). A correlation super matrix was 
subsequently created, which was numerically rescaled and 
another final MDS performed; the results were plotted in an 
n-dimensional space to compare the ‘global’ between-loci 
performance. Eigenvalues for each axis were computed to 
assign weights to dimensions. Two examples, the Onygenales 
and Fusarium datasets were analysed individually as described 
above, with identical loci sampling, except for the Onygenales 
which was supplemented with ‘y-Actin’ sequence data to provide 
examples on ‘local gene optima’. 


For sequences generated from primers designed by the Pfam 
approach of Lewis et al. (2011), bidirectional reads were edited 
manually using Geneious to generate consensus sequences. 
Sequences were aligned through MAFFT within Geneious 
and processed through R to compare intra- and inter-species 
pairwise distances (Schoch et al. 2014). TUB2 and TEF1 
sequences of Penicillium and Fusarium strains, respectively, 
that were already available were added to the dataset as 
comparison for species resolution. The aligned sequences of 
each marker were analysed individually with the ‘ape’ package 
(Paradis et al. 2004) for R (R Core Team 2014) to generate raw 
pairwise distances for each marker and each pair of strains. 
The values were separated as intra- or inter-species distances 
and the function ggplot2 (Wickham 2009) was used with R to 
generate a barcode gap plot for each gene by overlaying the 
distribution of the intra- and inter-species distances. 


RESULTS 


In silico selection of gene regions and manual design of 
initial seed-primers (gene-based primers) 


The distance super matrix, underlying the multiple alignments 
used to generate the reference topology, was correlated with 
all individual distance matrices using Pearson coefficient. One 
third of the genes (29.8 %) produced a phylogeny highly cor- 
related with the super matrix (Robert et al. 2011). Seventy per- 
cent of the gene matrices correlated at > 0.70 with the super 
matrix, and only 25 genes had no or a very low correlation. For 
the 531-gene matrix, maximum correlations were found for 
190 concatenated individual genes. The results from the com- 
putational inference indicated that a single or a very small num- 
ber of genes were sufficient to reflect the reference topology. 
Unfortunately, none of the alignment sections used later for the 
design of initial seed-primers scored high among candidates 
that reflected the ideal reference topology (see Robert et al. 
2011). However, key criteria for in silico selection of gene re- 
gions that would successfully qualify as secondary barcodes 
were universally met, including the conservation of primer sites 
among distant taxa (e.g. Ascomycota and Basidiomycota), 
with predicted amplicons no longer than 1 000 bp, a technical 
constraint for Sanger DNA sequencing and a requirement of 
the barcode standard. 


Qualification of gene regions was restricted to the 54 alignments 
that served for seeding novel primers and manual redesign, 
described above, and reflected a compromise between score 
against the super-matrix and PCR versatility (1 Kbp cut-off). 
However, the alignments are essentially the results of the in 
silico searches described as BPPM and SPM above, and non- 
redundant alignments were selected. Manually inspected align- 
ments (available upon request) and the identified conserved 
sites yielded a large number of primers (Table 3), which were 
tested in trial | and reduced to the best performing candidates 
(Table 6) for trial Il. A more rigorous test of primers success- 
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ful in trial II was undertaken in trial III (Table 6), using sets of 
extracts with increased species sampling of specific taxonomic 
groups. Because of time and resource constraints, some of the 
putative primers were not tested and should be tested in future 
studies, e.g. those corresponding to TEF2 (Table 3; Al45, Al48). 


Gene maps, indicating exact primer locations, and exon and 
intron boundaries for each gene, were designed for the best 
performing candidates (Table 6) and for some important ‘stand- 
ard’ markers (Table 2; the latter supplemented with positional 
information of commonly used primers). These gene maps re- 
present the fungal rDNA operon (Fig. 1), TUB2 (Fig. 2), ACT 
(Fig. 3), RPB2 (Fig. 4), TEF1 (Fig. 5), TEF3 (Fig. 6) and 60S 
L10 (L1) (Fig. 7). 


In silico design of initial seed-primers of protein regions 
and manual optimization (Pfam based primers) 


For the seven protein families found by Lewis et al. (2011) that 
could be potentially amplified with a single primer pair, a total of 
127 forward primers and 105 reverse primer pairs were tested, 
including some modified with M13 sequencing primers, for a 
total of 433 primer pairs always tested individually. /n silico 
analysis and manual refinement resulted in six primer pairs 
that passed all quality filters, which were subsequently used 
as described above (Table 7). Gene maps for TOP1, PGK and 
LNS2, indicating primer locations, exon and intron boundaries 
for each gene are shown in Fig. 8—10, respectively. 


Laboratory testing of in silico designed primer sets 
designed from gene alignments 


Laboratory tests were conducted for 71 sets of primers cor- 
responding to nine distinct regions (or 15 alignment groups) 
listed in Table 3 for their ability to consistently amplify a short 
standardised gene region resulting in a single PCR fragment. 


Trial | 


With trial |, we identified pairs of primers that resulted in a sin- 
gle PCR product employing a low annealing temperature (Ta), 
what required adjustment of PCR parameters and conditions. 
PCR tests with several primer pairs that contained an inosine 
(= I), instead of third-base wobbles, resulted in almost no visi- 
ble amplicons. We excluded all such primer pairs from further 
testing, including those that did not result in any detectable PCR 
products, and these are not mentioned further. 


Trial II 


For trial Il, the Ta was unchanged but the set of extracts was 
expanded to test a broader range of fungi, and a subset of seven 
primer pairs was selected that gave optimal results during trial I. 
These primer sets were tested for their reliability and consist- 
ently over a broad range of fungal species, with preference for 
those amplifying the desired marker as single fragment. Some 
primer sets yielded multiple fragments. Some of the best primer 
sets, derived from alignment groups 2 (ACT), 33 (TEF1a) and 
52 (L1) yielded a strong single signal with almost no second- 
ary fragments. Primers for alignment groups 46 (TUB2) and 49 
(TEF3) yielded multiple fragments, but had a strong primary 
signal. Of our Trial II candidates, the most problematic primers 
were those for a gene corresponding to a small RNA process- 
ing protein, a putative S6 kinase (alignment group 14), which 
resulted in multiple intense bands. 


Tailed primer design was a successful approach, resulting in 
PCR products from evolutionary diverse fungi for TEF 1a, a gene 
until now difficult to amplify on a universal basis with known 
standard primers. However, because of our stringent testing 
procedure, only alignment groups highlighted in bold (Table 3) 
qualified further for trial IIl, and these selections were strictly 
based on universal fidelity among all tested strains in trial II. 


Trial Il 


In the last experimental phase, primer testing was extended to 
species-specific DNA sets. These extracts were received from 
10 collaborating taxonomic experts at CBS (Table 1) in sets of 
96 extracts. This more extensive testing evaluated the capac- 
ity of sequences yielded by the selected primer pairs for their 
power to delimit closely related fungal species. The extracts 
represented relevant groups such as medically or phytopatho- 
logically relevant fungi, or were selected because reference 
datasets were available for comparison, or because the taxa 
represented complexes of species that were previously shown 
to be difficult to resolve. 


The performance of several primer pairs amplifying TEF3 (align- 
ment groups 47, 49, 50 and 51) and newly designed primers 
for TEF1a (alignment groups 17, 28, 31, 33) was consistently 
excellent. During trial III, we sequenced larger numbers of PCR 
products corresponding to TEF3, the 60S L10 (L1) and TEF1a 
to test the performance of our primers in standardised Sanger 
sequencing applications. Our results convinced us that capil- 
lary sequencing and production of high quality trace files did 
not require major modifications for the tested PCR primers or 
amplification/sequence parameters. For all amplicons assessed 
by Sanger sequencing, the sequences obtained were identical 
to those recorded in the original in silico inferred alignments for 
the respective target species. 


The designated primer pairs for TEF2 (alignment groups 45, 48) 
gave very poor results and were excluded from further testing. 
The primer pair for alignment group 28 (TEF7a) initially gave 
good results, but performed poorly in later experiments and 
was excluded from further testing. However, in common with 
the ‘untested’ pairs, the primers for TEF2 and TEF 1a (align- 
ment group 28) may be amenable to species-specific redesign 
(Table 6; Al28 = EF1_alternative_3f, EF1_alternative_3r), and 
thus represent interesting ‘seed’ candidates for further experi- 
ments. Best performing primers (Table 6), were distributed to 
collaborating colleagues for independent testing on taxon sets 
(Table 1). 


Amplification efficiencies 


Optimal visualisation of multivariate data matrices, showing 
relative proportions of ‘within’ and ‘between’ categories (i.e. 
datasets vs primers), was achieved with the ‘mosaic plot’ func- 
tion using ‘lattice’ in R (Sarkar 2008), with our categorical vari- 
ables encoded as typical ‘survival data’. Each plot is sectioned 
in rows representing datasets and columns representing primer 
pairs separated by primers designed by the complete genome 
approach (Fig. 11a, b) and primers designed by the Pfam ap- 
proach (Fig. 11c, d). Horizontal expansion of boxes indicates 
the ratio of PCR reactions within a ‘dataset’ or taxonomic group 
relative to the global quantity (proportion) of all other PCR val- 
ues (outcome), and vertical expansion of boxes indicates the 
ratio of individuals within quantity (proportion) of PCR values 
(outcome) of a specific primer test (e.g. ITS). Horizontal lines 
indicate value ‘zero’. When a line with a dot is on the left, it 
equals ‘zero’ positives (= no amplification), and when the line 
with a dot is on the right, it indicates ‘zero’ negatives (= 100 % 
amplification). Darkly shaded boxes indicate proportion of 
amplifications, relative to no amplification, which are shown 
by pale grey shaded boxes. Mosaic plots outperform other 
plot types in visualising large quantities of data and provide 
a global overview on data proportions only. The frequencies 
of positive and negative amplification for each primer pair are 
shown in Fig. 11b and d. 


The mosaic plot shows that the overall amplification testing 
between laboratories to test the primers designed through the 
complete genome approach (Robert et al. 2011) was strongly 
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Fig. 3 Structural gene map and positional primer locations for y-actin. 
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Fig. 4 Structural gene map and positional primer locations for fungal RNA polymerase subunit II. 
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Fig. 5 Structural gene map and positional primer locations for fungal translation elongation factor-1a. 
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Fig. 8 Structural gene map and positional primer locations for fungal TOP7 primarily based on Penicillium chrysogenum genome. 
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Fig. 10 Structural gene map and positional primer locations for fungal LNS2 primarily based on Penicillium chrysogenum genome. 


biased towards novel primer pairs, revealed by the many 
horizontal lines in some datasets corresponding to the ACT, 
TUB2, TEF1a(AFTOL primers), ITS, LSU, and RPB2 pairs (Fig. 
11a). Data corresponding to these pairs was mainly obtained 
from PCR experiments performed at CBS, and lack of data 
from collaborating laboratories can simply be explained by 
time and financial constraints. However, the results of these 
amplification experiments provide an important benchmark, 
even when inconsistently assessed, and valuable information 
on the universal performance of certain pairs. 


Promising results were obtained for several amplicons. The 
TEF 1a region, with the AFTOL pair (EF1-983F/EF1-1567R), 
enjoyed relatively consistent amplification for different taxon 
sets, with exceptions being relatively poor results for the Onyge- 
nales and rock-inhabiting Teratosphaeriaceae (80 % average). 


Results for sections of the fungal rDNA operon, especially the 
complete ITS and partial LSU (D1/D2 domains), were high 
in efficiency for all taxon sets, 92 % and 91 %, respectively 
(Fig. 11b). These results confirm the outstanding potential of 
ITS as primary universal fungal DNA barcode; it performs 
exceptionally well under standard laboratory conditions. No 
particular distinction was made between the possible forward 
primer combinations using ITS5/ITS1/ITS1-F. Similarly, LSU 
had very similar amplification efficiencies with LROR/LRS5, and 


augments the useful species-level signal ITS with increased 
phylogenetic information. 


The results for RPB2 (fRPB2-5f/fRPB2-7cR) were less encou- 
raging; only five taxon sets were tested for a single primer 
combination and the global performance of RPB2 can only be 
partly judged from our data. 


Two very commonly known gene sections widely employed in 
fungal phylogenetics (Fig. 2, 3), ACT and TUB2, had relatively 
consistent amplification efficiencies among datasets (mostly 
grey shaded boxes), showing that these pairs qualify well for 
universal amplification (84 % ACT and 86 % TUB2 on average, 
respectively). Unfortunately, as is obvious from Table 2, we 
tested different primer pairs for these genes, and no universal 
performing set was found (CA5R/CA14 for ascomycetous 
yeasts Il only; Bsens/Brev for Polyporaceae only, amplification 
success data not shown). The problem is apparent with asco- 
mycetous yeasts group II, where experiments were performed 
with CA5R and CA14 instead of ACT-512F/ACT-783R, and the 
entirely negative results of the latter PCRs were excluded from 
the analysis. A similar issue was observed in tests prior to this 
study for a Polyporaceae taxon set (data not shown), where 
only Bsens and Brev resulted in positive amplification (Lesage- 
Meessen et al. 2011). Surprisingly amplification of TUB2 (using 
Btub2Fd/Btub4Rd) in ascomycetous yeasts group II always 
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resulted in a single PCR product, but these amplicons always 
yielded noisy trace files. B-Tubulins are known to be duplicated, 
particularly in basidiomycetous fungi (Ayliffe et al. 2001, Zhao et 
al. 2014). The inconsistent PCR efficiencies and taxon-specific 
primer pairs disqualify ACT and TUB2 as universal barcodes. 


Overall amplification efficiencies of primer pairs were inconsist- 
ently recorded in different laboratories and our conclusions on 
universality remain speculative. The most comprehensive re- 
sults were retrieved for five of our newly designed primer pairs, 
EF1-1018F/EF1-1620R, EF1-1002F/1688R, 60S-908R/60S- 
506F, EF3-3185F/EF3-3538R and EF3-3188F/EF3-3984R. 
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Fig.12 Barcode gap analysis for ITS, PGK, TOP1, LNS2, TUB2 and TEF1. 


Performance of these three gene targets, as visualised in Fig. 
11, varied among these pairs, but successful amplification 
proved that predictions of universal primer binding sites were 
accurately computed. Of these, the TEF1a primer candidates 
tested consistently among all labs. The poorest performance 
(57 % average) occurred with EF1-1002F/1688R, which had 
low efficiencies for all ascomycetous and basidiomycetous 
yeasts, and for the Ceratocystis, Onygenales, Penicillium 2, 
Russula and Teratosphaeriaceae taxon sets. In contrast, the 
TEF1a pair EF1-1018F/EF1-1620R had the highest fidelity for 
all taxon sets and among all tested protein coding genes with 
an average amplification success of 88 % (Fig. 11), almost 
equal to ITS (92 %) and LSU (91 %). As a second best among 
our novel primer pairs, we identified a pair corresponding to 
a gene encoding the 60S ribosomal protein L10 (L1), previ- 
ously unused for barcoding or phylogenetic purposes. This 
primer pair had an average success rate of 77 %, with local 
optima over 95 %, and performed poorly only for some asco- 
mycetous and basidiomycetous yeasts, and the Coniothyrium, 
Mycosphaerellaceae and Onygenales sets (~40—60 %, Fig. 
11). The novel region TEF3, although a fungal-specific gene, 
could not be universally amplified with high success rates, with 
neither the short nor long sections yielding satisfactory results 
for fidelity (short EF3-3185F/EF3-3538R, 68 % on average; 
long EF3-3188F/EF3-3984R, 52 % on average). Nevertheless, 
promising local optima for both TEF3 pairs were observed. 
The long section could be retrieved (~70—90 % success) for 
the Sordariales (Colletotrichum, Ceratocystis, Scedosporium), 
Hypocreales (Fusarium s.str., Hypocreales s.lat.), the Dothidio- 
mycetes (Mycosphaerellaceae) and Eurotiales (Penicillium 1, 
2). The shorter TEF3 section was amplifiable for a broader 
spectrum of fungi with relatively high success rates, including 
the Sordariales, Hypocreales, Onygenales, Eurotiales and even 
within the basidiomycetes (~70—90 % success). Despite this, 
failure to retrieve products for other taxa, such as the yeasts 
sensu lato, strongly decreased the universal efficiency of both 
TEF3 primers. We did not globally test a third TEF3 pair, EF3- 
3186F/EF3-3984R2, because of poor performance in the trial 
Ill. Nevertheless some taxon specific efficiency, in particular 
for Fusarium (Hypocreales) and Scedosporium, (Sordariales) 
were recorded. 


For the primers designed by the Pfam domain approach (Lewis 
et al. 2011), PGK533 primer pair performed the best for this 
locus, generating fragments of about 1 kb with a success rate 
about as good as ITS with a wide range of fungi, except for most 
Basidiomycetes (Fig. 11c, d). The TOP7 primer pair amplified 
an 800 bp fragment efficiently for most ascomycetes but per- 
formed rather poorly for lower fungi and most basidiomycetes, 
the notable exception being Pucciniomycotina which are very 
challenging for ITS (Fig. 11c, d). LNS2 generated a short frag- 
ment (less than 400 bp) but amplified efficiently in most lower 
fungi and in all basidiomycetes. 


Barcode gap analyses 


As expected, many closely related species have a pairwise 
distance of zero (Fig. 12). PGK, TOP1 and LNS2 all show a 
higher resolution than ITS. Our data shows that PGK and TOP1 
are as good as TUB2 or TEF 7a to resolve closely related Peni- 
cilium and Fusarium species, respectively (data not shown). 
There is insufficient data to make such statement for LNS2 as 
testing for ascomycetes was less of a priority than testing for 
basidiomycetes. 


Multi-dimensional scaling of sequence data 

Results of the Mantel test (Smouse et al. 1986), comparing 
pairwise distance matrices of: i) the ‘global 502 taxa dataset’ 
(Fig. 13a, b, i); ii) ‘Fusarium’ (Fig. 13c, d, g); and iii) ‘Onygenales’ 
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Fig. 13 Multi-dimensional scaling of rescaled cophenetic correlation coefficients comparing distance matrices. 
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Table 8 Rescaled cophenetic correlation coefficients from Mantel test comparing distance matrices. 


Gene/Dataset Gene 


Onygenales 60S Actin All TEF1a TEF3 ITS LSU 

60S 1 0.869922 0.945842 0.665818 0.741716 0.878008 0.828932 
Actin 0.869922 1 0.86677 0.718523 0.709779 0.833556 0.7986 
All 0.945842 0.86677 1 0.703723 0.801272 0.947347 0.879415 
TEF1a 0.665818 0.718523 0.703723 1 0.58968 0.656179 0.654457 
TEF3 0.741716 0.709779 0.801272 0.58968 1 0.739432 0.751655 
ITS 0.878008 0.833556 0.947347 0.656179 0.739432 1 0.822984 
LSU 0.828932 0.7986 0.879415 0.654457 0.751655 0.822984 1 
Fusarium 60S All TEF1a TEF3 ITS LSU 

60S 1 0.833529 0.748277 0.564056 0.717349 0.818429 

All 0.833529 1 0.770656 0.773935 0.766443 0.737515 

TEF1a 0.748277 0.770656 1 0.502797 0.616714 0.687664 

TEF3 0.564056 0.773935 0.502797 1 0.523543 0.532899 

ITS 0.717349 0.766443 0.616714 0.523543 1 0.645249 

LSU 0.818429 0.737515 0.687664 0.532899 0.645249 1 

Global 60S All TEF1a TEF3 ITS LSU 

60S 1 0.678678 0.621512 0.68173 0.88979 0.748801 

All 0.678678 1 0.77138 0.754324 0.698097 0.694224 

TEF1a 0.621512 0.77138 1 0.572193 0.68485 0.635149 

TEF3 0.68173 0.754324 0.572193 1 0.724653 0.670825 

ITS 0.88979 0.698097 0.68485 0.724653 1 0.778558 

LSU 0.748801 0.694224 0.635149 0.670825 0.778558 1 


(Fig. 13e, f, h) by pairwise cophenetic coefficients of correla- 
tion, were plotted (numerically rescaled 0—1) in n-dimensional 
space. Rescaled cophenetic coefficients for each dataset are 
shown in Table 8. Assessment of global performance employing 
a comparison of TEF1a distances matrices (with sequences 
obtained from pair EF1-1018F/EF1-1620R; AFTOL pair EF 1- 
983/EF1-1567 combined), TEF3 (sequences obtained from pair 
EF3-3185F/EF3-3538R, EF3-3188F/EF3-3984R combined), 
60S, ITS and LSU for 502 taxa, showed that the first two axes 
contributed more than 75 % of the total information and the first 
three axes over 90 %. While 60S and ITS were rendered closely 
together and apart from LSU, neither TEF1a nor TEF3 were 
correlated closely to one of the rDNA matrices (Fig. 13a, b). 
The most significant result from visualising the cophenetic 
coefficients of correlation is apparent in Fig. 13a and i. The 3D 
plot and hierarchical clustering indicated that TEF1a correlated 
optimally with the overall concatenated (ALL) data matrix. Scor- 
ing of individual genes (matrices) is equivalent to distances as 
obvious from the dendrogram in Fig. 13i. The comparison of 
the ‘Fusarium’ matrices showed the same trend with the first 
three axes describing the vast majority of variance. Our results 
show that optimal correlation of the global dataset (ALL) was 
achieved with the 60S matrix (Fig. 13c, d, g), and the universal 
barcode ITS correlated poorly with the concatenated (ALL) ob- 
ject. The impact of the second dimension is obvious, because 
TEF3 is more distant from the concatenated matrix than ITS in 
the second but not the first dimension. Therefore, the weighted 
impact of dimensions causes the performance of TEF3 to 
appear inferior to ITS, presumably an artefact of distance 
over true character-based (cladistic/phylogenetic) inference 
(Felsenstein 2003). This could also be related to comparisons 
of cophenetic coefficients employing the Mantel test (Harmon 
& Glor 2010). Visualisation of the results from comparing data 
matrices obtained from the ‘Onygenales’ dataset contradicted 
those of the ‘502 taxon dataset’. We observed that the combi- 
nation of two genes, ITS and TEF3, resulted in the ordination 
highly resembling the one of the concatenated (ALL) matrix. 
The impact of the third dimension scaled the LSU matrix more 
distantly than the two latter genes in Fig. 13e but was less pro- 
nounced in Fig. 13f. Second closest, and in addition to the two 
previous datasets, ACT was inferred as inferior to 60S and ITS 


when compared to the overall dataset. Both elongation factor 
matrices, TEF1a and TEF3, were inferior to ITS and were both 
distant from the overall concatenated matrix (ALL). Hierarchical 
clustering of individual matrices performance against the overall 
dataset is shown in Fig. 13h. In general, TEF 7a performed the 
best among all ‘502’ investigated taxa when comparing 5+1 
gene datasets, which was biased by local optima. The best 
candidate genes for Fusarium (60S) and Onygenales (ITS and 
60S) rendered one of our new candidates as an optimal ‘local’ 
barcode for these two sets. The pairwise comparison for each 
dataset and each gene is given in Table 8 by rescaled (0—1) 
cophenetic correlation coefficients, with key results reflected 
by MDS analysis (Fig. 13) above. 


DISCUSSION 


DNA-barcoding: a conceptual overview 


In the early 1960s some milestone papers by Edward & Cavalli- 
Sforza (1964), Zuckerkandl & Pauling (1965) and Hennig (1965) 
gave rise to the novel field now known as molecular phylogene- 
tics (Felsenstein 2003), including some decades later ‘DNA bar- 
coding’. The goal of DNA barcoding is to apply high-throughput 
DNA sequencing technology to large-scale screening of one or a 
few reference genes to identify unknown specimens to species, 
and enhance discovery of new species (Hebert et al. 2003a, b, 
Stoeckle 2003). Proponents of ‘DNA-based taxonomy’ envisage 
the development of comprehensive databases of reference se- 
quences, centred on voucher specimens or living cultures that 
represent all described species, against which sequences from 
newly sampled individuals can be compared. Given the long 
history of use of molecular markers for this purpose for iden- 
tifying microbes (e.g. 16S rDNA for prokaryotes, Avise 2004), 
there is nothing fundamentally novel about DNA barcoding as 
applied in microbiology and mycology today, except for the 
increased scale and proposed standardisation. Standardisation 
by the selection of one or more reference genes is critical and 
stimulates large-scale phylogenetic analyses, but whether ‘one 
gene fits it all’ is still an open debate. 


Initial reactions to the coordinated DNA barcoding movement 
ranged from great enthusiasm, especially among ecologists 
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(Janzen 2004), to condemnation, largely from traditional taxo- 
nomists. The proponents emphasized its application to modern 
meta-barcoding studies, increasing the precision and efficiency 
of field studies of rare, diverse and difficult to identify taxa 
across all kingdoms of organisms, while DNA barcoding is not 
possible without large-scale and well curated sequence data- 
bases (Schoch et al. 2014), founded in traditional systematics 
and referencing authoritatively identified voucher specimens. 
DNA barcoding should not be confused or confounded with 
efforts to resolve the ‘tree of life’ or establish large-scale 
phylogenies. DNA barcodes often have limited phylogenetic 
resolution, reflecting the use of a single-gene assay in an at- 
tempt to identify an entity to species or reveal inconsistencies 
between molecular variation and current species delimitations. 
Thus, while barcoding initiatives use similar knowledge and 
techniques, resolving phylogenies from species to major higher 
clade levels requires a different strategy for gene selection. 


Irrespective of which kingdom of organisms is targeted, the 
conceptual character and constraints of DNA barcoding always 
remain identical. We need to separate the two applications: 
i) molecular diagnostics of individuals relative to described taxa; 
and ii) DNA-enabled discovery of new species. Both are inher- 
ently phylogenetic and rely on a solid taxonomic background, 
including adequate sampling of variation within species and 
inclusion of all previously described extant species within a 
given genus or clade. Accurate species diagnoses rely on 
careful examination of intra-species variation compared with 
that observed between species. So a short DNA sequence 
(barcode) will allow reliable allocation of an individual to a 
described taxon. Such identifications should be accompanied 
by a clear statement of statistical support such as branch sup- 
port in a phylogenetic analysis. This view is in agreement with 
best taxonomic practice that is acceptance of name changes 
or subdivisions should be based on multiple lines of evidence. 
As a Strictly designated concept to species identification and 
discovery, a key theory of DNA barcoding is testing of ‘bounda- 
ries’ between taxa established by prior solid taxonomic research 
(Hebert et al. 2003a, b, Hebert & Gregory 2005, Meyer & Paulay 
2005, Schindel & Miller 2005, Schoch et al. 2012). 


One fungus, which genes? 


In this study, we identified alternative universal gene targets for 
the fungal kingdom from available fully sequenced genomes 
(data processed in 2011) using taxonomy-aware in silico pipe- 
lines. Two approaches were used to infer potential novel candi- 
dates and corresponding primer pairs. Each approach yielded 
promising, but different candidate barcodes. 


The search for the ideal gene, using the genomic approach and 
alignment templates of Robert et al. (2011) was based on five 
criteria: i) presence in the majority of investigated genomes; 
ii) potential ease of PCR amplification and Sanger DNA 
sequencing; iii) restriction to putative single copy orthologs, 
constrained to genes harbouring conserved sites < 1 Kbp 
apart; iv) the potential to accurately delimit fungal species; 
and eventually v) an inherent phylogenetic signal. Often, the 
distance between conserved nucleotide stretches was too 
long, or conserved sites were not present across a sufficient 
proportion of the analysed genomes. Primers were tested and 
marker sequences we sequenced were the best compromise 
between technical feasibility, and a holistic ‘one-gene-fits-all’ 
barcode locus. We identified known genes and some not yet 
used for either barcoding or phylogenetic applications, result- 
ing in a long list of candidate primers. While most alignment 
templates were functionally annotated (e.g. TEF1a), only 14 of 
54 were identified as putatively novel candidate regions with no 
known track record in fungal DNA barcoding or phylogenetics. 


Although TEF 1a is already well known to provide superior sub- 
ordinate taxon resolution in some groups of Ascomycetes such 
as Trichoderma and Fusarium, and it is sufficiently present in 
public repositories, there was still a need to improve the uni- 
versality of primers. From the genomic studies, TEF1a qualified 
as the only universal barcode candidate. Some of the newly 
identified genes through the Pfam approach, in particular TOP1 
and PGK, which were tested more intensively for sequencing 
and analyses of barcode gaps, show great promise, especially 
for ascomycetes, where both performed well in genera with 
particularly narrow barcode gaps in the ITS, i.e. Penicillium 
and Fusarium. 


In general, slowly evolving genes (e.g. TEF1a, RPB2 exons) 
are more suitable for inference of deep phylogenetic relation- 
ships, while genes with higher evolutionary rates (e.g. ITS, 
TUB2) reflect more recent evolutionary and speciation events. 
Thus, genes or gene sections reflecting both of these charac- 
teristics are most suitable as barcodes if phylogenetic signal 
is considered important. Fragments of protein-coding genes 
can potentially meet both requirements, either by incorporating 
more variable intronic and more conserved exonic sequences, 
or by providing one level of variability as nucleotide sequences 
and a more conserved suite for phylogenies as amino acid 
sequences. Our results show that TEF 7a is likely to be one of 
the most ideal candidates for non-rDNA barcodes because of its 
balanced ‘trade-off’s’, in particular when it comes to a broader 
community acceptance of extant users, and versatility among 
important fungal orders. 


Finding primer candidates to amplify a standardised DNA frag- 
ment with high phylogenetic information content in distantly 
related species was difficult with the computational resources 
available in 2011 when our research was conducted. In our 
in silico search of available genomes of dikaryotic fungi, only 
seven primer pairs covered 72 species of the 74 species 
studied. For ascomycetes, six primer pairs were identified as 
compatible with 57 species studied and 186 primer pairs with 
56 species. The universality of our best-performing primers 
in silico was experimentally verified, at least for Asco- and 
Basidiomycota. Some primer candidates were located almost 
adjacent to each other with only a few bases difference at 
either the 5’ or 3’ primed ends. No new candidates with high 
local optima such as the MCM7 and Tsr1 primers (Schmitt et 
al. 2009) were discovered. 


From the genomic scans, the fragment of the translation elong- 
ation factor 1a (TEF1a) gene appears to be the most promising 
candidate as a universal secondary barcode. We remain cau- 
tious about fidelity for basal fungal lineages but we believe our 
primers for TEF1a are the most optimal compromise between 
universal taxon applicability and fidelity. In our study we have 
particularly tested fungal taxa of economic or applied impor- 
tance. We foresee that a high fidelity secondary DNA barcode 
would be included in a very large portion of ongoing taxonomic 
studies. Based on our results and anticipating community ac- 
ceptance, we propose the use of primers EF1-1018F and EF 1- 
1620R (corresponding to TEF 1a) as secondary universal DNA 
barcode primer pair for the fungal kingdom. Given the results of 
this extensive four-year multi-lab project, we are convinced that 
fungal DNA barcoding with ITS and TEF 7a for the identification 
of unknown or validation of ‘primary-barcoded’ specimens will 
increase the resolution and lead to the development of more 
complete, structured datasets. Global consensus might also 
speed up the discovery and description of novel taxa in a 
standardised way. 


Another critical question is whether the proposal of a second- 
ary DNA barcode actually is still relevant in a world where next 
generation sequencing is becoming routine in many laborato- 
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ries, at a cost that is comparable to Sanger sequencing? The 
costs for sequencing a complete fungal genome are dropping 
rapidly. Despite this, the fungal taxonomic community lacks the 
computational resources and capacities to routinely process 
complete genomes for specimen identification or for phyloge- 
netic studies, especially in the developing world. Undoubtedly, 
the rapid increase of publicly available fungal genomes will 
enhance our understanding of evolutionary processes govern- 
ing niche adaption, mating or gene family expansion related to 
pathogenicity and other functional capabilities. Genome plastic- 
ity, architecture and high genome sequence identity between 
sister species will force us to rethink ‘tree thinking’ (Bapteste 
et al. 2005). Despite this, genome sequencing and its subse- 
quent Big Data explosion have yet to establish a solid basis for 
molecular species identification, in particular when there is low 
sequence coverage per taxon, insufficient sampling of species 
diversity or, reciprocally, when complete genomic sequences 
may require an updated classification. 


Rank inflation, i.e. the tendency to reclassify taxa at higher taxo- 
nomic levels (such as a previously recognised family as an 
order) is currently commonplace in fungal taxonomy, and re- 
quires a more standardised approach. Agreement on conserved 
universal protein coding markers, such as TEF1a, that would 
allow a more precise nesting of unknown fungi among known 
higher level taxa will be valuable. 


DNA barcoding has great impact on human and animal health 
diagnostics. We hope that this paper, with its detailed gene 
maps and large quantity of tested and yet to be tested primers, 
will serve as a reference for expanded barcoding (e.g. further 
testing of the fungal specific translation elongation factor 3 in 
a clinical setting). It seems unlikely that there will be a similar 
study of this magnitude, employing PCR, Sanger DNA sequenc- 
ing and testing so many universal primers on a large collection 
of fungal taxa. Therefore, we believe that the knowledge ob- 
tained during this study would convince the community to ratify 
TEF1a as secondary barcode in order to ensure its rapid ap- 
plication with the improved primer system, but also to promote 
the research on other target genes for other lineages of fungi. 
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